NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Allen, C; Kirtland, A; Tao, R; Lobel, S; Scott, D; Petrocelli, N; Gottesman, O; Parr, R; Littman, M; Konidaris, G (December 2024, Advances in Neural Information Processing Systems)

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to—or knowledge of—an underlying, unobservable state space. Our metric, the λ-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD(λ) with a different value of λ. Since TD(λ=0) makes an implicit Markov assumption and TD(λ=1) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the λ-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the λ-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different λ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.
more » « less
Full Text Available
DeepMellow: Removing the Need for a Target Network in Deep Q-Learning

https://doi.org/10.24963/ijcai.2019/379

Kim, S; Asadi, K; Littman, M; Konidaris, G (August 2019, Proceedings of the Twenty Eighth International Joint Conference on Artificial Intelligence)

Deep Q-Network (DQN) is an algorithm that achieves human-level performance in complex domains like Atari games. One of the important elements of DQN is its use of a target network, which is necessary to stabilize learning. We argue that using a target network is incompatible with online reinforcement learning, and it is possible to achieve faster and more stable learning without a target network when we use Mellowmax, an alternative softmax operator. We derive novel properties of Mellowmax, and empirically show that the combination of DQN and Mellowmax, but without a target network, outperforms DQN with a target network.
more » « less
Full Text Available
State Abstractions for Lifelong Reinforcement Learning

Abel, D.; Arumugam, D.; Lehnert, L.; Littman, M. (January 2018, Proceedings of the 35th International Conference on Machine Learning)

Full Text Available
Policy and Value Transfer in Lifelong Reinforcement Learning

Abel, D.; Jinnai, Y.; Guo, Y.; Konidaris, G.; Littman, M. (January 2018, Proceedings of the 35th International Conference on Machine Learning)

Full Text Available
Learning Approximate Stochastic Transition Models

Song, Y.; Grimm, C.; Wang, X.; Littman, M. (January 2017, arXiv preprint arXiv:1710.09718)

Full Text Available
Showing versus doing: Teaching by demonstration

Ho, M. K.; Littman, M. L.; MacGlashan, J.; Cushman, F.; Austerweil, J. L. (January 2016, NeurIPS)

People often learn from others’ demonstrations, and inverse reinforcement learning (IRL) techniques have realized this capacity in machines. In contrast, teaching by demonstration has been less well studied computationally. Here, we develop a Bayesian model for teaching by demonstration. Stark differences arise when demonstrators are intentionally teaching (i.e. showing) a task versus simply performing (i.e. doing) a task. In two experiments, we show that human participants modify their teaching behavior consistent with the predictions of our model. Further, we show that even standard IRL algorithms benefit when learning from showing versus doing.
more » « less
Full Text Available

Search for: All records